[BugFix] Abort BE vacuum tasks once the FE caller's timeout elapses by starrocks-xupeng · Pull Request #74694 · StarRocks/starrocks

starrocks-xupeng · 2026-06-11T08:23:18Z

Why I'm doing:

The FE gives up waiting for a vacuum RPC after its brpc timeout (LakeService.TIMEOUT_VACUUM, 1 hour), marks the partition vacuum as failed and re-dispatches it shortly after. But the BE-side vacuum task is never cancelled: it keeps running as a zombie, occupying one of the few workers of the RELEASE_SNAPSHOT thread pool (5 by default) for hours, while nobody reads its response. On clusters with partitions that accumulated a huge number of versions, zombie tasks can exhaust the whole pool: newly dispatched vacuum requests pile up in the queue, vacuum throughput collapses cluster-wide, and the version backlog keeps growing.

What I'm doing:

Add optional int64 timeout_ms = 11 to VacuumRequest (gensrc/proto/lake_service.proto): the maximum duration the FE caller waits for the request.
FE: AutovacuumDaemon#vacuumPartitionImpl fills timeoutMs with LakeService.TIMEOUT_VACUUM, the brpc timeout of the vacuum RPC, so the BE deadline matches exactly how long the FE actually waits.
BE: LakeServiceImpl::vacuum anchors an absolute deadline (butil::gettimeofday_ms() + timeout_ms) at the time the request is received and passes it to lake::vacuum (new deadline_ms parameter, default 0 = no deadline, all other callers unchanged).
BE: vacuum_impl checks the deadline once at entry — a task that already exceeded the deadline while waiting in the thread pool queue aborts without doing any work; collect_files_to_vacuum checks it on each iteration of the version-chain walk (the dominant cost for high-version-count partitions) and aborts with Status::TimedOut, freeing the worker. Aborting between walk iterations leaves the metadata chain untouched, so the next vacuum round resumes from the same state.
BE: new mutable config lake_vacuum_enable_task_timeout (default true) gates the deadline: when set to false the BE ignores timeout_ms and vacuum tasks always run to completion.
Requests without timeout_ms (older FE versions) carry no deadline and run to completion, exactly as before.
UT: test_vacuum_deadline_expired_mid_walk (deadline expires mid-walk via a mocked clock on the new vacuum:check_deadline sync point: nothing is deleted, and a follow-up run without deadline converges normally), test_vacuum_task_deadline_exceeded (handler threads timeout_ms into the task and returns TIMEOUT; a request without the field is unaffected, and so is any request when lake_vacuum_enable_task_timeout is off), and testVacuumRequestCarriesTimeout (FE fills the field).

Fixes #issue

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
- This pr needs auto generate documentation
This is a backport pr

Bugfix cherry-pick branch check:

I have checked the version labels which the pr will be auto-backported to the target branch
- 4.1
- 4.0
- 3.5

CelerData-Reviewer · 2026-06-11T08:29:06Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9620972e97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

The FE gives up waiting for a vacuum RPC after its brpc timeout (1 hour), but the BE task keeps running as a zombie: it occupies one of the few RELEASE_SNAPSHOT workers and races with re-dispatched vacuums of the same partition for hours, while nobody reads its response. Carry the FE timeout in VacuumRequest.timeout_ms. The BE vacuum handler anchors an absolute deadline when the request is received and threads it through the vacuum execution; the version-chain walk checks the deadline on each iteration and aborts with Status::TimedOut once it passes. The check at vacuum entry also kills tasks that already exceeded the deadline while waiting in the thread pool queue. Requests from older FEs without the field carry no deadline and run to completion as before.

…deadline

github-actions · 2026-06-12T02:24:19Z

No new undocumented parameters detected by the param-drift check.

xiangguangyxg · 2026-06-12T05:54:24Z

+            // The longest this FE waits for the response (the brpc timeout of the vacuum RPC).
+            // The BE checks it periodically during execution and aborts the task once it has
+            // elapsed, instead of running on as a zombie that no caller is waiting for.
+            vacuumRequest.timeoutMs = LakeService.TIMEOUT_VACUUM;


better to make vacuum timeout configurable

github-actions · 2026-06-12T06:33:03Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2026-06-12T06:41:19Z

[FE Incremental Coverage Report]

✅ pass : 1 / 1 (100.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/lake/vacuum/AutovacuumDaemon.java	1	1	100.00%	[]

github-actions · 2026-06-12T06:56:01Z

[BE Incremental Coverage Report]

✅ pass : 20 / 20 (100.00%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/service/service_be/lake_service.cpp	4	4	100.00%	[]
🔵	be/src/storage/lake/vacuum.cpp	16	16	100.00%	[]

wanpengfei-git added the PROTO-REVIEW label Jun 11, 2026

github-actions Bot added 4.0 4.1 3.5 labels Jun 11, 2026

wanpengfei-git requested a review from a team June 11, 2026 08:23

mergify Bot assigned starrocks-xupeng Jun 11, 2026

github-actions Bot requested review from meegoo and xiangguangyxg June 11, 2026 08:31

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread be/src/storage/lake/vacuum.cpp

starrocks-xupeng force-pushed the vacuum_task_timeout branch from 9620972 to efca032 Compare June 11, 2026 09:43

[BugFix] Add lake_vacuum_enable_task_timeout to gate the vacuum task …

736c245

…deadline

xiangguangyxg approved these changes Jun 12, 2026

View reviewed changes

xiangguangyxg reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix] Abort BE vacuum tasks once the FE caller's timeout elapses#74694

[BugFix] Abort BE vacuum tasks once the FE caller's timeout elapses#74694
starrocks-xupeng wants to merge 2 commits into
StarRocks:mainfrom
starrocks-xupeng:vacuum_task_timeout

starrocks-xupeng commented Jun 11, 2026 •

edited

Loading

Uh oh!

CelerData-Reviewer commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

xiangguangyxg Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

starrocks-xupeng commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

Uh oh!

CelerData-Reviewer commented Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

xiangguangyxg Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 12, 2026

[Java-Extensions Incremental Coverage Report]

Uh oh!

github-actions Bot commented Jun 12, 2026

[FE Incremental Coverage Report]

file detail

Uh oh!

github-actions Bot commented Jun 12, 2026

[BE Incremental Coverage Report]

file detail

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

starrocks-xupeng commented Jun 11, 2026 •

edited

Loading